Skip to content

StreamInterface: prevent socket/reader-thread leak on handshake failure#918

Open
nightjoker7 wants to merge 1 commit intomeshtastic:masterfrom
nightjoker7:fix-stream-handshake-leak
Open

StreamInterface: prevent socket/reader-thread leak on handshake failure#918
nightjoker7 wants to merge 1 commit intomeshtastic:masterfrom
nightjoker7:fix-stream-handshake-leak

Conversation

@nightjoker7
Copy link
Copy Markdown

Problem

StreamInterface.__init__ calls self.connect() which spawns the reader thread and opens the stream, then calls self.waitForConfig() which blocks for the config handshake. If either raises — handshake timeout on a busy/loaded node is the common case — the partially-constructed object propagates the exception up, but:

  • the reader thread spawned in connect() keeps running
  • the underlying stream (socket on TCPInterface, serial.Serial on SerialInterface, etc.) stays open
  • the caller never received a reference to self, so they cannot call close() themselves

In a caller that retries on failure (passive loggers, reconnect loops, monitoring daemons), the thread + stream leak compounds with every retry. On the firmware side, every re-attempted handshake triggers a fresh config-dump + NodeDB-dump on the node, so the client-side leak directly inflates heap pressure on the radio.

Fix

Two small changes in meshtastic/stream_interface.py:

  1. __init__: wrap connect() + waitForConfig() in try/except that calls self.close() on any exception before re-raising. This ensures the reader thread and the stream are released even when __init__ cannot return normally.

  2. close(): guard self._rxThread.join() against RuntimeError — happens when close() runs from a failed __init__ before connect() spawned the reader thread.

How it was found

Hit this leak on a fleet-monitoring daemon retrying TCPInterface(host) every 15 seconds against a Station-G2 node with NodeDB at MAX_NUM_NODES=250. Each handshake timed out, each retry leaked a reader thread + socket, and the compound reconnect pressure on the node correlated with cascading reboots. After applying this fix client-side (together with firmware null-checks in meshtastic/firmware#10226 and #10229), the same deployment has been stable.

Minimal repro: run TCPInterface('<host that accepts TCP but does not respond to meshtastic protocol>') in a retry loop and watch threading.active_count() climb.

Backwards compatibility

Happy-path behavior is unchanged. On the failure path, callers that previously received an exception from __init__ and did nothing will now see the same exception but without the orphaned thread + stream. Callers that relied on the thread staying alive after a failed __init__ (unlikely — they had no reference to it) would notice, but that would be an unusual pattern.

Related

…re in __init__

If connect() or waitForConfig() raises during __init__ (handshake timeout,
bad stream, config error), the reader thread started by connect() keeps
running and the underlying stream/socket stays open — but the caller never
receives a reference to the half-initialized instance, so they cannot call
close() themselves. The leak compounds on every retry from a caller's
reconnect loop.

Fix: wrap connect() + waitForConfig() in try/except; call self.close() on
any exception before re-raising. Also guard close() against RuntimeError
from joining an unstarted reader thread (happens when close() runs from
a failed __init__ before connect() could spawn it).

Discovered while debugging a real-world Meshtastic firmware crash where
a passive logger's retrying TCPInterface() calls against a node with
250-entry NodeDB produced a reconnect storm — every retry triggered a
full config+NodeDB dump on the node, compounding heap pressure, which
then exposed null-deref bugs in Router::perhapsDecode / MeshService
(firmware side fixed in meshtastic/firmware#10226 and #10229). The
client-side leak is independent of those firmware bugs and worth fixing
on its own.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a resource-leak failure mode in StreamInterface where exceptions during initial protocol handshake (connect()/waitForConfig()) could leave a running reader thread and an open underlying transport (serial/TCP), even though the constructor raised before returning an instance to the caller.

Changes:

  • Wrap connect() + waitForConfig() in StreamInterface.__init__ with exception handling that attempts to close() before re-raising.
  • Make StreamInterface.close() tolerate RuntimeError from joining an unstarted reader thread.

Comment on lines +156 to +157
# Nothing to join; safe to ignore.
pass
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If connect() fails before the reader thread is started (e.g., serial write raises before _rxThread.start()), this RuntimeError path will skip joining but also won’t close the underlying stream. For SerialInterface (which opens the port before calling StreamInterface.__init__), this means the serial port can remain open even though close() was invoked. Consider explicitly closing self.stream (and setting it to None) when the thread was never started / is not alive, instead of just pass.

Suggested change
# Nothing to join; safe to ignore.
pass
# In that case the reader thread cannot run _disconnected(),
# so close the stream here to avoid leaking the resource.
if self.stream is not None:
with contextlib.suppress(Exception):
self.stream.close()
self.stream = None

Copilot uses AI. Check for mistakes.
Comment on lines 150 to +153
if self._rxThread != threading.current_thread():
self._rxThread.join() # wait for it to exit
try:
self._rxThread.join() # wait for it to exit
except RuntimeError:
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

join() without a timeout can block indefinitely if the reader thread is stuck in a blocking _readBytes() (e.g., TCPInterface uses blocking socket.recv() with no timeout). With the new __init__ exception handler calling self.close(), a handshake timeout against a silent TCP peer can hang here and never re-raise the original exception. Consider using a bounded join (and letting transport-specific close() logic interrupt the read), or otherwise ensuring blocking reads are interrupted before joining.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +78
try:
self.connect()
if not noProto:
self.waitForConfig()
except Exception:
# Handshake failed (timeout, config error, bad stream). The caller
# never receives a reference to this half-initialized instance, so
# they cannot call close() themselves. If we don't clean up here,
# the reader thread (already started by connect()) keeps running
# and the underlying stream/socket leaks — the leak compounds on
# every retry from the caller's reconnect loop.
with contextlib.suppress(Exception):
self.close()
raise
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New failure-path behavior (calling close() from __init__ when connect() / waitForConfig() raises) is important for leak prevention but currently isn’t covered by unit tests. Since there are existing tests for stream_interface.py, consider adding a test that forces connect() or waitForConfig() to raise and asserts cleanup occurs (e.g., close() is invoked and doesn’t raise when the thread wasn’t started).

Copilot generated this review using guidance from repository custom instructions.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 46.15385% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.52%. Comparing base (cec79a7) to head (9011baa).

Files with missing lines Patch % Lines
meshtastic/stream_interface.py 46.15% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #918      +/-   ##
==========================================
- Coverage   61.57%   61.52%   -0.06%     
==========================================
  Files          25       25              
  Lines        4448     4457       +9     
==========================================
+ Hits         2739     2742       +3     
- Misses       1709     1715       +6     
Flag Coverage Δ
unittests 61.52% <46.15%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants